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Abstract 

We consider various data-analysis queries on two-dimensional points. We give 
new space/time tradeoffs over previous work on geometric queries such as domi- 
nance and rectangle visibility, and on semigroup and group queries such as sum, 
average, variance, minimum and maximum. We also introduce new solutions 
to queries less frequently considered in the literature such as two-dimensional 
quantiles, majorities, successor /predecessor, mode, and various top-fc queries, 
considering static and dynamic scenarios. 
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1. Introduction 

Multidimensional grid s arise as n atural representations to support conjunc- 



tive queries in databases jBCKO08l |. Typical queries such as "find all the em- 
ployees with age between xq and xi and salary between yo and j/i" translate 
into a two-dimensional range reporting query on coordinates age and salary. 
More generally, such a grid representation of the data is useful to carry out a 
number of data analysis queries over large repositories. Both space and time 
efficiency are important when analyzing the performance of data structures on 
massive data. However, in cases of very large data volumes the space usage can 
be even more important than the time needed to answer a query. More or less 
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space usage can make the difference between maintaining all the data in main 
memory or having to resort to disk, which is orders of magnitude slower. 

In this paper we study various problems on two-dimensional grids that are 
relevant for data analysis, focusing on achieving good time performance (usu- 
ally polylogarithmic) within the least possible space (even succinct for some 
problems). 

Counting the points in a two-dimensional range Q = [xcXi] x [?/Oiyi]: 
computing Count((5), is arguably the most primitive operation in data anal- 
ysis. Given n points on an 71 x 71 grid, one can co mpute C ount in time 
0(logn/loglogn) using "linear" space , 0{n ) integers Nek09l | . This time is 



optimal within space 0(n polylog(n)) jP(j7| . and it has been matched using 
asy mptotically minimum [i.e., succinct) space, n -\- o{n) integers, by Bose et 
al. IBHMM09| . 



The k points in the range can be reporte d in tim e 0(loglogn -f k) using 
O(nlog'^n) integers, for any constant e > |ABR00l |. This time is optimal 
with in spac e 0(npolylog(n)), by reduction from the colored predeces sor prob - 
lem |PT06| . With 0(n) integers, the time becomes 0{{k + Dlog^ nl ICLPll l. 



Within n -\- o{n) integers space, one can achieve time 0{k ^^°f^^ ) [BHMMO^ 

We start with two geometric problems, dominance and rectangular visibil- 
ity. These enable data analysis queries such as "find the employees with worst 
productivity-salary combination within productivity range [a;o,xi] and salary 
range [yo: DiW that is, such that no less productive employee earns more. 

The best current result for the 4-sided variant of these problems {i.e., where 
the points are limited by a general rectangle Q) is a dynamic structure by Brodal 
and Tsakalidis |BT11| |. It requires 0(n log n)-integers space and reports the d 
dominant/visible points in time 0(log^ n -I- d). Updates take O(log^n) time. 
They achieve better complexities for simpler variants of the problem, such as 
some 3-sidcd variants. 



Our results build on the w avelet t ree |GGV03l ] , a succinct-space variant of a 



classical structure by Chazelle ICha88 1 . The wavelet tree has been used t o handle 



various geometric problems, e.g., |MN07l . iGFTOQl IbhmmoqI IblnsioI . IbcnioI 



iGNPll . We show in Section [3] how to use wavelet trees to solve dominance and 
visibility problems using n -I- o{n) integers space and 0{{d -f 1) logn) time. The 
dynamic version also uses succinct space and requires 0{{d+l) log^ n/ log log n) 
time, carrying out u pdates in time 0(log'^ n/ log logn). Compared to the best 



current result |BT11| , our structure requires succinct space instead of 0{n log n) 
integers, offers better update time, and has a comparable query time (being 
faster for small d = O(loglogn)). 

The paper then considers a wide range of queries we call "statistical" : The 
points have an associated value in [0, W) = [0, W^— 1] and, given a rectangle Q, 
we consider the following queries: 

Sum/Avg/Var: The sum/average/variance of the values in Q (Section |4|) . 
Min/Max: The minimum/maximum value in Q (Section [5]) . 
QuANTiLE: The fc-th smallest value in Q (Section [7]). 
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MAJORlTY(a): The values appearing with relative frequency > a in Q (Sec- 
tions |6] and |8]) . 

Succ/Pred: The successor/predecessor of a value w in Q (SectionlHl). 

These operations enable data-analysis queries such as "the average salary 
of employees whose annual production is between a^o and xi and whose age is 
between yo and yi". The minimum operation can be used to determine "the 
employee with the lowest salary" , in the previous conditions. The a-majority op- 
eration can be used to compute "which are frequent (> 20%) salaries" . Quantile 
queries enable us to determine "which are the 10% highest salaries". Succes- 
sor queries can be used to find the smallest salary over $100,000 among those 
employees. 

Other applications for such queries are frequently found in Geographic In- 
formation Systems (GIS), where the points have a geometric interpretation and 
the values can be city sizes, industrial production, topographic heights, and so 
on. Yet another application comes from Bioinformatics, where two-dimensional 
points with intensities are obtained from DNA microarrays, and var ious kinds 



of data-analysis activities are carried out on them. See Rahul et al. [RGJR11 | 
for an ample discussion on some of these applications and several others. 

A popular set of st atistica l queries includes range sums, averages, and max- 
ima/minima. Willard |Wil85j solved two-dimensional range-sum queries on fi- 
nite groups within 0(7ilogn)-integers space and O(logn) ti me. Thi s includes 
Sum and is easily extended to Ave and Var. Alstrup et al. jABROOf obtained 
the same complexities for the semigroup model, which includes Min/M ax. The 



latter ca n also be solved i n constant time using 0(n^ )-integers space |AFL07 



BDRllj. Chazelle jChaSSl ] showed how to reduce the space to 0(n/e) integers 
and achieve time 0(log^^'^ n) on semigroups, and 0(log^~'''^ n) for the particular 
case of Min/Max. 

On this set of queries our contribution is to achieve good time complexities 
within linear and even succinct space. This is relevant to handle large datasets 
in main memory. While the times we achieve are not competitive when using 
0(n) integers or more, we manage to achieve polylogarithmic times within just 
n log n-\-o(n log n) bits on top of the bare coordinates and values, which we show 
is close to the information-theoretic minimum space necessary to represent n 
points. 

As explained, we use wavelet trees. These store bit vectors at the nodes of 
a tree that decomposes the y-space of the grid. The vectors track down the 
points, sorted on top by cc-coordinate and on the bottom by y-coordinate. We 
enrich wavelet trees with extra data aligned to the bit vectors, which speeds up 
the computation of the statistical queries. Space is then reduced by sparsifying 
these extra data. 

We also focus on more sophisticated queries, for which fewer results exist, 
such as quantile, majority, and predecessor/successor queries. 

In one dimension, the best result we k now o f for quantiles queries is a linear- 
space structure by Brodal and J0rgensen |BJ09j . which finds the fc-th element of 
any range in an array of length n in time 0(logn/loglogn), which is optimal. 
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Operation 


Space per point (bits) 


Time 


Time in linear space 


Source 


Sum, Avg, Var 


\ogn{l + l/t) 


0(min(i log W, log log W log n) 


O(log''rj) 


Thm. El 


MiN, Max 


logT;.(l + l/i) 


0(min(t log m. log ri)t log n) 


O(log^i) 


Thm. El 


Majority(q). fixed 


logn(2 + l/i)+logm 


0(tlogmlog 71.) 


0(log''n) 


Thm. [ini 


QUANTILE 


log n logi m + 0{log jn) 


0(^log 71 log^ m) 


0(n') 


Thm. [n 


MAJORlTY(a), variable 


log n logg m + 0{log m) 


0{^^'lognlog^ m) 


0(n') 


Thm. [la 


Succ, Pred 


log n log^ m + 0{\og m) 


0(^log7ilog^ m) 


0{n') 


Thm. ini 



Tabic 1: Our static results on statistical queries, for n two-dimensional points with associated 
values in [0, W)\ m = min(n, W); 2 < I < u and t > 1 arc parameters. The space omits the 
mapping of the real {x,y) coordinates to the space [0, n), as well as the storage of the point 
values. The 5th column gives simplified time assuming logn = 0(log W), any constant e, and 
use of O(nlogn) bits. 



An a-majority of a range Q is a value that occurs more than a ■ Count((5) 
times inside Q, for some a £ [0,1]. The ct-majority p r oblem was previously 
consi dered in on e and two dimensions KN08 . DHM+ll . GHMN1]] |. Durocher 
et al. [DHM"*"!!! solve one-dimensional a-range majority queries in time 0{l/a) 
using 0(71,(1 -|- log(l/a))) integers. Here a must be chose n when cre ating the 



data structure. A more recent result, given by Gagie et al. [GHMNllI ]. obtains 
a structure of 0(n?(H 4- 1) log(l/Q;)) bits for a dense n x n matrix {i.e., every 
position contains an element), where H is the entropy of the distribution of 
elements. In this case a is also chosen at indexing time, and the structure can 
answer queries for any /3 > a. The resulting elements are not guaranteed to 
be /3-majorities, as the list may contain false positives, but there are no false 
negatives. 

Other related queries have been studied in two dimensions. Rahul et al. |RGJR11| 
considered a variant of Quantile where one reports the top-k smallest/largest 
values in a range. They obtai n 0(n log^ n)-integers space and 0(log n+k log log n) 
time. Navarro and Nekrich reduced the spac e to 0( n/e) integers, with 

time 0{\og^^'^ n + klog"^ n). Durocher and Morrison DM11 | consider the mode 
(most repeated value) in a two-dimensional range. Their times are sublinear 
but super-polylogarithmic by far. 

Our contribution in this case is a data structure of 0(r;. log n) integers able 
to solve the three basic queries in time 0(log^ n). The space can be stretched up 
to linear, but at this point the times grow to the form 0{n'^). Our solution for 
range majorities lets a to be specified at query time. For the case of a known 
at indexing time, we introduce a new linear-space data structure that answers 
queries in time O(log'^n), and up to 0{\og^ n) when using 0(r;, log n) integers 
space. 

In this case we build a wavelet tree on the universe of the point values. A 
sub-grid at each node stores the points whose values arc within a range. With 
this structure we can also solve mode and top-k most-frequent queries. 

Table [T] shows the time and space results we obtain in this article for statis- 
tical queries. Several of our data structures can be made dynamic at the price 
of a sublogarithmic penalty factor in the time complexities, as summarized in 
Table H 
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Operation 



Space per point (bits) 



Query time 



Update time 



MiN, Max 

QUANTILE 



Sum, Avg, Var 



logi7(l + o(l) + l/0 
logC/(l + o(l) + l/t) 
logt/log,W(l + o(l)) 
log£/log,W(l + o(l)) 
log£/log,W(l + o(l)) 




■)) 
■)) 



O(logt/log77) 

0(logC/(logn + iloglV)) 
0(Iog U log 71 log^ W/ log log n) 
0(log U log n log^ W/ log log 7i) 
0(log U log 71 log^ W/ log log n) 



Majority(q), var. 
Succ, Pred 



O [t log U log 71 log^ Wj log log 7l) 

0{ log U log n logf W/ log log 77) 
0{t log f/ log n log^ Ty/ log log 7i) 



Table 2: Our dynamic results on statistical queries, for n two-dimensional points on an J7 X J7 
grid with associated values in [0, W)\ 2 < £ < W , t > I and < e < 1 are parameters. The 
space omits the mapping of the real x coordinates to [0, n), as well as the point values. The 
first line is proved in Thm.[7l the second in Thm[9l and the rest in Thm. [141 

2. Wavelet Trees 



Wavelet trees [GGV03l | are defined on top of the basic Rank and Select 
functions. Let B denote a bitmap, i.e., a sequence of O's and I's. RANK(i3, 
counts the number of times bit h £ {0, 1} appears in B[0, i], assuming RANK(i3, 6,-1) = 
0. The dual operation, Select {B,b,i), returns the position of the i-th occur- 
rence of b, assuming SELECT(i3, b, 0) = — 1. 

The wavelet tree represents a sequence S[0, n) over alphabet S = [0, a), and 
supports access to any S[i], as well as Rank and Select on S, by reducing 
them to bitmaps. It is a complete binary tree where each node v may have a 
left child labeled (called the 0-child of v) and a right child labeled 1 (called 
the 1-child) . The sequence of labels obtained when traversing the tree from the 
Root down to a node v is the binary label of v and is denoted L(v). Likewise 
we denote V{L) the node that is obtained by following the sequence of bits L, 
thus V{L{v)) = V. The binary labels of the leaves correspond to the binary 
representation of the symbols of S. Given c € S we denote by V{c) the leaf 
that corresponds to symbol c. By c{..d} we denote the sequence of the first d 
bits in c. Therefore, for increasing values of d, the V{c{..d}) nodes represent 
the path to V{c). 

Each node v represents (but does not store) the subsequence S{v) of S 
formed by the symbols whose binary code starts with L{v). At each node v we 
only store a (possibly empty) bitmap, denoted B{v), of length |S'(w)|, so that 
B(v)\i] = iS S{v)\i]{..d} = L{v)-0, where d= \L{v)\ + l, that is, if S{v)\i] also 
belongs to the 0-child. A bit position i in B{v) can be mapped to a position in 
each of its child nodes: we map i to position R{v, b, i) = Rank(_B(i'), b,i) — l of 
the 6-child. We refer to this procedure as the reduction of i, and use the same 
notation to represent a sequence of steps, where b is replaced by a sequence of 
bits. Thus i?(RoOT, c, i), for a symbol c £ S, represents the reduction of i from 
the Root using the bits in the binary representation of c. With this notation we 
describe the way in which the wavelet tree computes Rank, which is summarized 
by the equation Rank(5, c, i) = i?(RoOT,c, i) -I- 1. We use a similar notation 
i?(u, w', i), to represent descending from node v towards a given node v', instead 
of explicitly describing the sequence of bits b such that L{v') — L{v) ■ b and 
writing i?(u, 6, i). 

An important path in the tree is obtained by choosing i?(w, B{v) [i] , i) at each 
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node, i.e., at each node we decide to go left of right depending on the bit we are 
currently tracking. The resulting leaf is therefore this process provides 

a way to obtain the elements of S. The resulting position is i?(RoOT, S[i], i) = 
Rank{S, S\i],i) - 1. 

It is also possible to move upwards on the tree, reverting the process com- 
puted by R. Let node v be the &-child of v' . Then, if i is a bit position in B{v), 
we define the position Z{v, v' , i), in B{v'), as SELECT(i?(z;'), 6, i + l). In general, 
when v' is an ancestor of v, the notation Z[v,v',i) represents the iteration of 
this process. For a general sequence. Select can be computed by this process, 
as summarized by the equation Select(S', c, i) = Z{V{c), Root, i — 1). 



Lemma 1 The wavelet tree for a sequence S[0,n) 



over alphabet E = [0, cr) requires at most nloga + o{n) bits of spaced It solves 
Rank, Select, and access to any S[i] in time O(logcr). 

Proof. Grossi et al. jGGV03 proposed a representation using n log ct +0( " \o°n " ) 



-|~0(<Tlogn) bits. Makinen and Navarro showed how to use only one pointer 
per level, reducing the last term to Oilogalogn) = O(log^n) = o(n). Fi- 
nally, Golynski et al. jGGG+07 | showed how to support binary Rank and 



Select in constant time, while reducing the redundancy of the bitmaps to 
0(r7, log log ri/ log^ ri), which added over the nloga bits gives o{n) as well. □ 

2.1. Representation of Grids 

Consider a set P of n distinct two-dimensional points (x, y) from a universe 
[0, U) X [0, U). We map coordinates to rank space using a standard method 
|Cha88llABRSoj : We store two sorted arrays X and Y with all the (possibly 
repeated) x and y coordinates, respectively. Then we convert any point (x, y) 
into rank space [0, n) x [0, n) in time 0(log n) using two binary searches. Range 
queries arc also mapped to rank space via binary searches (in an inclusive man- 
ner in case of repeated values). This mapping time will be dominated by other 
query times. 

Therefore we store the points of P on a [0, n) x [0, ri) grid, with exactly one 
point per row and one per column. We regard this set as a sequence 5[0, n) and 
the grid is formed by the points {i,S[i]). Then we represent S using a wavelet 
tree. 

The space of X and Y corresponds to the bare point data and will not be 
further mentioned; we will only count the space to store the points in rank space, 
as usual in the literature. In [Appendix A we show how we can represent this 



mapping into rank space so that, together with a wavelet tree representation, 
the total space is only 0{n) bits over the minimum given by information theory. 

The information relative to a point po = (s^OiZ/o) is usually tracked from 
the Root and denoted i?(RoOT, ?/o{..(i}, xq). A pair of points po — {xo,yo) 
and pi = {xi,yi)^ where xq < xi and yo < yi, defines a rectangle; this is 



•^Prom now on the space will be measured in bits and log will be to the base 2. 
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the typical query range we consider in this paper. Rectan gles ha ve an implicit 



representation in wavelet trees, spanning O(logn) nodes [MN07[. The binary 



representation of yo and yi share a (possibly empty) common prefix. There- 
fore the paths V{yo{..d}) and V{yi{..d}) have a common initial path and then 
split at some node of depth k, i.e., V{yQ{..d}) = V{yi{..d}) for d < fc and 
Viyol.d'}) ^ V{yi{..d'}) for d' > k. Geometrically, F(yo{-fc}) = V{yi{..k}) 
corresponds to the smallest horizontal band of the form [j ■ , {j + 1) ■ 
that contains the query rectangle Q, for an integer j. For d' > k the nodes 
V{yo{..d'}) and V{yi{..d'}) correspond respectively to successively thinner, non- 
overlapping bands that contain the coordinates j/o ^md yi. 

Given a rectangle Q = [xo,.ti] x [2/0,2/1] we consider the nodes V{yo{..d} ■ 1) 
such that yo{..d} ■ 1 7^ yo{..d + 1}, and the nodes V{yi{..d} ■ 0) such that 
yi{..d} ■ ^ yi{..d+ 1}. These nodes, together with V{yo) and V{yi), form 
the implict representation of [2/0, yi], denoted imp(2/o, 2/i)- The size of this set is 
O(logn). Let us recall a well-known application of this decomposition. 

Lemma 2. Given n two-dimensional points, the number of points inside a query 
rectangle Q = [xo,.ti] x [2/0,2/1]; Count((5), can be computed in time O(logn) 
with a structure that requires nlogn + o{n) hits. 

Proof. The result is J2vemp{yo yi) -R(RoOT, u, xi) - i?(RoOT, a;o - 1). Notice 
that all the values R{RoOT,yo{..d},x) and R{RoOT,yi{..d},x) can be com- 
puted sequentially, in total time O(logn), for x ^ xi and .t = xo — 1. For a 
node V € imp(2/o,2/i) the desired difference can be computed from one of these 
values in time 0(1). Then the lemma follows. □ 

Th is is not th e best possible result for this problem (a better result by Bose 
ct al. BHMMO9I I exists), but it is useful to illustrate how wavelet trees solve 



range search problems. 



3. Geometric Queries 

In this section we use wavelet trees to solve, within succinct space, two 
geometric problems of relevance for data analysis. 



3.1. Dominating Points 

Given points po = {xo,yo) and pi = {xi,yi), we say that po dominates pi 
if xo > xi and 2/0 > yi- Note that one point can dominate the other even if 
they coincide in one coordinate. Therefore, for technical convenience, in the 
reduction described in Section [^?T1 points with the same y coordinates must be 
ranked in Y by increasing x value, and points with the same x coordinates must 
be ranked in X by increasing y value. A point is dominant inside a range if 
there is no other point in that range that dominates it. In Fig. [T]we define the 
cardinal points N, S, E, W, SW, etc. We first use wavelet trees to determine 
dominant points within rectangles. 
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Theorem 3. Given n two-dimensional points, the d dominating points inside 
a rectangle Q = [a;o,a;i] x [yojyi] can be obtained (in NW to SE order) in time 
0{{d + 1) logn), with a data structure using nlogn + o{n) bits. 



Proof. Let v e iMP(yo,?/i) be nodes in the implicit representation of [y^,yi\. 
We perform depth-first searches (DFS) rooted at each v G iMP(?;o,yi), starting 
from V{yi) and continuing sequentially to the left until ^(j/o)- Each such DFS 
is computed by first visiting the 1-child and then the 0-child. As a result, we 
will find the points in N to S order. 

We first describe a DFS that reports all the nodes, and then restrict it to 
the dominant ones. Each visited node v' tracks the interval (i?(RoOT, u', — 
1), i?(RoOT, w', xi)]. If the interval is empty we skip the subtree below v' . As 
the grid contains only one point per row, for leaves v' = w^'^ the intervals 
(i?(RoOT, v^'^\xq — \), i?(RoOT, v'^^\xi)] contain at most one value, correspond- 
ing to a point p^^ e P n Q. Then x^') = , Root, i?(RoOT, i;^*), xi)) and 
p(') = ^[a;^')]). Reporting k points in this way takes 0[[k + 1) logn) time. 

By restricting the intervals associated with the nodes we obtain only domi- 
nant points. In general, let v' be the current node, that is either in lMP(yo,2/i) 
or it is a descendant of a node in iMP(j/o,yi), and let {x'^'^\ S[x^^'^]) be the last 
point that was reported. Instead of considering the interval (i?(RoOT, v, xq — 
1), i?(R00T, V, xi)], consider the interval (i?(RoOT, w,x(*'),i?(R00T, v,xi)]. This 
is correct as it eliminates points {x,y) with x < x^^\ and also y < S'[x*'*^], given 
the N to S order in which wc deliver the points. 

As explained, a node with an empty interval is skipped. On the other hand, 
if the interval is non-empty, it must produce at least one dominant point. Hence 
the cost of reporting the d dominant points amortizes to 0{{d + 1) \og?i). □ 




= X('u<*' , Root, fl;(RooT, -u'*' , xi)) 



Figure 1; Dominance on wavelet tree coor- 
dinates. The grayed points dominate all the 
others in the rectangle. We also show the 4 
directions. 



Figure 2: Rectangle visibility. For SW visi- 
bility the problem is the same as dominance. 
We grayed the points that are visible in the 
other 3 directions. 
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3.2. Rectangle Visibility 

Rectangle visibility is another, closely related, geometric problem. A point 
p € P is visible from a point q = (xq, yo), not necessarily in P, if the rectangle 
defined by p and q as diagonally opposite corners does not contain any point of 
P. Depending on the direction from q to p the visibility is called SW, SE, NW 
or NE (see Fig. [2]). Next wc solve visibility as variant of dominance. 

Theorem 4. The structure of Thm. can compute the d points that are visible 
from a query point q = (xq, yo), in order, in time 0{{d + 1) logn). 

Proof. Note that SW visibility corresponds precisely to determining the domi- 
nant points of the region [0, xq] x [0, yo]- Hence we use the procedure in Thm. [31 
We now adapt it to the three remaining directions without replicating the struc- 
ture. 

For NE or SE visibility, we change the definition of operation R to R{v, b, i) = 
Rank(B(u), b,i- I) + 1, thus Rank(5', c, i) = i?(R00T, c,i) + 1 if S[i] = c and 
Rank(S', c, i) ~ i?(RoOT, c, i) otherwise. In this case we track the intervals 
[i?(R00T, v', xo), i?(R00T, v', xi + 1)). This i?(RoOT, w', xi + 1) is replaced by 
i?(RoOT, x^*-') when restricting the search with the last dominant point. 

For NE or NW visibility, the DFS searches first visit the 0-child and then use 
the resulting points to restrict the search on the visit to the 1-child, moreover 
they first visit the node ^(yo) and move to the right. 

Finally, for NW or SE visibility our point ordering in presence of tics in X 
or Y may report points with the same x or y coordinate. To avoid this we 
detect ties in X or y at the time of reporting, right after determining the pair 
p(') = (x, y) = (x^*), 5[x(*)]). In the NW (SE) case, we binary search for the last 
(first) positions x' such that X[x'] = X[x] and y' such that Y\y'] = Y\y]. Then 
we correct p^*^ to (x',y) (to (x, y')). The subsequent searches are then limited 
by x' instead of x = x*^*^ . We also limit subsequent searches in a new way: we 
skip traversing subsequent subtrees of iMP(2/0)yi) until the y values are larger 
(NW) or smaller (SE) than y' . Still the cost per reported point is O(logn). □ 

3.3. Dynamism 

We can support point insertions and deletions on a fixed UxU grid. Dynamic 
variants of the bitmaps stored at each wavelet tree node raise the extra space t o 



o(logC/) per point and multiply the times by 0(logn/loglogri,) [HMl 



a s pace t o 
Ipj- lNSlOj . 



Lemma 5. Given n points on a U x U grid, there is a structure using nlogU + 
o{n\ogU) bits, answering queries in time 0{t{\ogU)logn/ loglogn), wheret{h) 
is the time complexity of the query using static wavelet trees of height h. R 
handles insertions and deletions in time O (log ?7 log n/ loglogn). 

Proof. Wc use the same data structure and query algorithms of the static 
wavelet trees desc ribed in Secti on 12.11 yet representing their bitmaps with the 
dynamic variants HM10I . InS10I |. We also maintain vector X, but not Y; we use 



the y-coordinates directly instead since the wavelet tree handles repetitions in 
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y. Having a wavelet tree of depth log U makes the time tilog U), whereas using 
dynamic bitmaps multiplies this time by 0(logn/loglogn), 

Instead of an array, we use for X a B-tree tree with arity 0(log U). Nodes ar e 
handled with a standard technique for managing cells of different sizes Mun86j , 



which wastes just 0(log U) bits in total. As a result, the time for accessing a 
position of X or for finding the range of elements corresponding to a range of 
coordinates is 0{logU), which is subsumed by other complexities. The extra 
space on top of that of the bare coordinates is 0{n + log^ U) bits. This is 
o(n\ogU) unless n = o(logC/), in which case we can just store the points in 
plain form and solve all queries sequentially. It is also easy to store differentially 
encoded coordinates in this B-tree to reduce the space of mapping the universe 
of X coordinates to the same achieved in Section 12.11 

When inserting a new point apart from inserting x into X, we track 

the point downwards in the wavelet tree, doing the insertion at each of the log U 
bitmaps. Deletion is analogous. □ 

As a direct application, dominance and visibility queries can be solved in 
the dynamic setting in time 0{{d + 1) logj/logn/loglogn), while supporting 
point insertions and deletions in time 0(logf71ogn/loglogn). The only issue 
is that we now may have several points with the same y-coordinate, which is 
noted when the interval is of size more than one upon reaching a leaf. In this 
case, as these points arc sorted by increasing x-coordinate, we only report the 
first one (E) or the last one (W). Ties in the x-coordinate, instead, are handled 
as in the static case. 



4. Range Sum, Average, and Variance 

We now consider points with an associated value given by an integer func- 
tion w : P —i' [0,W). We define the sequence of values W{v) associated to 
each wavelet tree node v as follows: If S{v) ~ po,Pi, ■ ■ ■ ,P\s{v)\, then W{v) = 
w{po)-, w{pi), . . . , w{p\s{v)\)- We start with a solution to several range sum prob- 
lems on groups. In our results we will omit the bare n[logH^] bits needed to 
store the values of the points. 

Theorem 6. Given n two-dimensional points with associated values in [0, VF), 
the sum of the point values inside a query rectangle Q = [a;o,a;i] x [yo,yi\, 
Svm{Q), can be computed in time 0{mm.(tlogW,\ogn)t\ogW\ogn), with a 
structure that requires nlogn(l + 1/i) bits, for any t > 1. It can also compute 
the average and variance of the values, Avg{Q) and Yar{Q) respectively. 

Proof. We enrich the bitmaps of the wavelet tree for P. For each node v we 
represent its vector W{v) = w(jio), w{pi), . . . , w{p^s{v)\) as a bitmap A{v), where 
we concatenate the unary representation of the w(pi)'s, i.e., w{pi) O's fo llowed 
by a 1. These bitmaps A{v) are represented in a compressed format |OS07 | 
that requires at most \S{v) \ log W + 0{\S{v)\) bits. With this structure we can 
determine the sum w{po)+w{pi) + . . .+w{pi), i.e., the partial sums, in constant 
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time by means of Select(A(w), 1, i) queriefl Wsum(w, i) = Select(^(u), + 
is the sum of the first i + 1 values. In order to compute Sum((3) we use a 
formula similar to the one of Lemma [21 

WsuM(i;,i?(RoOT, w,a;i)) - WsuM(w,i?(RoOT, UjXo - 1)). (1) 

ijeiMP(yo,yi) 

To obtain the tradeoff related to t, we call t = tlog and store only every 
T-th entry in A, that is, we store partial sums only at the end of blocks of 
T entries of W{v). We lose our ability to compute Wsum(u, i) exactly, but 
can only compute it for i values that are at the end of blocks, Wsum(w, t ■ i) = 
Select {A{v),l,i + 1) — i. To compute each of the terms in the sum of Eq. ([T]) we 
can use Wsum(w, t • [i?(RoOT, v, xi)/t\ ) - Wsum(?;, t • [i?(RoOT, v, Xq - l)/r] ) 
to find the sum of the part of the range that covers whole blocks. Then we must 
find out the remaining (at most) 2t — 2 values w{pi) that lie at both extremes 
of the range, to complete the sum. 

In order to find out those values, we store the vectors W{v) explicitly at all 
the tree nodes v whose height h{v) is a multiple of r, including the leaves. If a 
node V S iMP(yo,2/i) does not have stored its vector W{v), it can still compute 
any ■w{pi) value by tracking it down for at most r levels. 

As a result, the time to compute a Sum(Q) query is O(r^logn), yet it is 
hmited to 0(r log^ n) if r > logn, as at worst we have W{v) represented at the 
leaves. The space for the A{v) vectors is at most {\S{v)\/T){\ogW + 0{1)) bits, 
which adds up to nlogn(logVl^ + 0(1))/t bits. On the other hand, the W{v) 
vectors add up to nlogn{\ogW)/T = n{\ogn)/t bits. This holds for any t even 
when we store always the W{v) vectors at the leaves: The space of those W{v) 
is not counted because we take for free the space needed to represent all the 
values once, as explained. 

The average inside Q is computed as Avg{Q) = Sum(Q)/Count(Q), where 
the latter is computed with the same structure by just adding up the interval 
lengths in imp(2/o, To compute variance we use, conceptually, an additional 
instance of the same data structure, with values w'{p) = 'u?[p). Then Var((3) = 
Sum'(Q)/Count(Q) - (Sum((5)/Count((5))2, where Sum' uses the values w' . 
Note that in fact we only need to store additional (sampled) bitmaps A'{v) 
corresponding to the partial sums of vectors W'{v) (these bitmaps may need 
twice the space of the A{v) bitmaps as they handle values that fit in 2 log W 
bits). Explicitly stored vectors W'{v) are not necessary as they can be emulated 
with W{v), and we can also share the same wavelet tree structures and bitmaps. 
This extra space fits within the same 0(7i(log7i)/t) bits. □ 

[Appendix B| shows how to further reduce the constant hidden in the O 
notation. This is important because this constant is also associated with the 
[log bits of the weights, that are being omitted from the analysis: In the 
case of w' we have 2[logVF] bits per point. 



*Using constant-time Select structures on their internal bitmap H [OS07l |. 
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Finite groups. The solution applies to finite groups (G, ©,~^ ,0). We store 
Wsum(z;, i) = w{po) © w{pi) © . . . 'w{pi), directly using [log IG]] bits per entry. 
The terms WsuM('(;,i)-WsuM(z;,j) ofEq. ([T]) are replaced by WsuM(w,j)~^ © 
WSUM(ti, i). 

Dynamism 

A dynamic variant is obtained by using the dynamic wavelet trees of Lcmma[51 
a dynamic partial sums data structure instead of A{v), and a dynamic array for 
vectors W{v). 

Theorem 7. Given n points on a U x U grid, with associated values in [0, W), 
there is a structure that uses n log U{1 + o{l) + 1/t) bits, for any t > 1, that an- 
swers the queries in Theorem\^in time 0(log U log n(l+min(i log W, log U)t log W 
/ log log n)), and supports point insertions /deletions, and value updates, in time 
OilogUlogn). 

Proof. The algorithms on the wavelet tree bitmaps are carried out verbatim, 
now on the dynamic data structures of Lemma [SJ which add o{n log U) bits 
of space overhead and multiply the times by 0(log?7/loglogn). This adds 
0(min(i log W, log U)t log W log U log n/ log log n) to the query times. 

Dynamic arrays to hold the explicit W{v) vectors can be implemented within 
log H^(l+o(l)) bits, and provide access and indels in time 0(log n/ log log n) 
NSlOl . Lemma 1]. This adds 0{t log W log n/ log log n) to the query times, which 



is negligible. 

For insertions we must insert the new bits at all the levels as in Lemma [SJ 
which costs 0(log U log n/ log log n) time, and also insert the new values in W{v) 
at 1 + (log U)/{t log W) levels, which in turn costs time 0((1 + (log U)/{t log W)) 
log n/ log log n) (this is negligible compared to the cost of updating the bitmaps). 
Deletions are analogous. To update a value we just dele te and reinsert the point. 



A structure for dynamic searchable partial sums [MNOSj takes n log W + 
o{n\ogW) bits to store an array of n values, and supports partial sums, as well 
as insertions and deletions of values, in time 0(log?T.). Note that we carry out 
0(logC/) partial sum operations per query. We also perform OilogU) updates 
when points are inserted/deleted. This adds 0(logC/logri,) time to both query 
and update complexities. 

Maintaining the sampled partial sums A{v) is the most complicated part. 
Upon insertions and deletions we cannot maintain a fixed block size r. Rather, 



we use a technique GNOSf that ensures that the blocks are of length at most 



2t and two consecutive blocks add up to at least r. This is sufficient to ensure 
that our space and time complexities hold. The technique does not split or 
merge blocks, but it just creates/removes empty blocks, and moves one value to 
a neighboring block. All those operations are easily carried out with a constant 
number of operations in the dynamic partial sums data structure. 

Finally, we need to mark the positions where the blocks start. We can main- 
tain the s equence of 0{\S{v)\/t) block lengths using again a partial sums data 



structure [MNOSj . which takes 0((|S'(u)|/r) log r) bits. The starting position 
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of any block is obtained as a partial sum, in time O(logn), and the updates 
required when blocks are created or change size are also carried out in time 
O(logn). These are all within the same complexities of the partial sum struc- 
ture for A{v). □ 

Finite groups and semigroups. The sol ution ap plies to finite groups (G, ®,~^ 
0). The dynamic structure for partial sums can be easily converted into 



one that stores the local "sum" w{pi) © wlpi+i) (B ■ ■ ■w{pj) of each subtree con- 
taining leaves . . .pj. The only obstacle in applying it to semigroups is 
that we cannot move an element from one block to another in constant time, be- 
cause we have to recalculate the "sum" of a block without the element removed. 
This takes time 0(t), so the update time becomes 0(log [/(logn + t log W)). 



5. Range Minima and MsLxima 

For the one-dimensional problem there exists a data structure using just 
2n + o (n) bit s, which answers queries in constant time without accessing the 
values FislCf . This structure allows for a better space/time tradeoff compared 



to range sums. 

For the queries that follow we do not need the exact w{p) values, but just 
their relative order. So we set up a bitmap ^^[l,^^] where the values oc- 
curring in the set are marked . This bitmap can be stored within at most 
m\og{W/m)+0{m) bits |OS07 |. where m < min(n, W) is the number of unique 



values. This representation converts between actual value and relative order in 
time O (log to), which will be negligible. This way, many complexities will be 
expressed in terms of to. 

Theorem 8. Given n two-dimensional points with associated values in [0,W), 
the minimum of the point values inside a query rectangle Q ~ [a:;o,a;i] x [yo,yi], 
Min((5), can be found in time 0(min(tlogTO, logn)tlogn), with a structure us- 
ing ?ilog7i(l + 1/t) bits, for any t > 1 and m = min(n, M^). The maximum of 
the point values inside a query rectangle Q can be found within the same time 
and space bounds. 



Proof. Wc associate to each node v the onc-dimensional data structure FislOl | 



corresponding to W{v), which takes 2|M^(d)| -|- o(|M^(w)|) bits. This adds up to 
2nlog7i -f 0(71 log n) bits overall. We call Wmin(z;,i, j) = arg mini<s<j M^(z;)[s] 
the one-dimensional operation. Then we can find in constant time the position 
of the minimum value inside each v e imp(j/o, yi) (without the need to store the 
values in the node), and the range minimum is 

min M^('u)[Wmin(i;, i?(RoOT, 1;, xq), R{v, Root, v, xi + 1) - 1)]. 

WGIMP(l/0,1/l) 

To complete the comparison we need to compute the O(logn) values M^(w)[s] 
of different nodes v. By storing the W{v) vectors of Theorem [6] (in the range 
[l,m]) every r = tlogm levels, the time is just 0(min(T, logn) logn) because 
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we have to track down just one point for each v £ lMP(yo,yi)- The space is 
3n log n + {n log n log m) /r = 3n log n + (n log n) /t bits. The second term holds 
for any t even when we always store n log to bits at the leaves, because adding 
these to the TOlog(W^/TO) + 0{m) bits used for V, we have the nflogM^] bits 
corresponding to storing the bare values and that are not accounted for in our 
space complexities. 

To reduce the space further, we split W{v) into blocks of length r and create 
a sequence W'{v), of length |S'(u)|/r, where we take the minimum of each block, 
= min{M^(-(;)[(i — 1) • r + 1], . . . , W{v)[i ■ r]}. The one-dimensional data 
structures are built over W'{v), not W{v), and thus the overall space for these 
is 0((n/r)logn) bits. In exchange, to complete the query we need to find the 
r values covered by the block where the minimum was found, plus up to 27- — 2 
values in the extremes of the range that are not covered by full blocks. The 
time is thus 0(r min(r, log n) log n). By setting r = t we obtain the result. 

For Max{Q) we use analogous data structures. □ 

Top-k que ries in s uccinct space. Wc can now solve the top-A: query of 



Rahul et al. RGJRllI ] by iterating over Theorem [H Let us set ?- = 1. Once we 
identify that the overall minimum is some W^(u)[s] from the range W{v)[i^ j], we 
can find the second minimum among the other candidate ranges plus the ranges 
W{v)[i,s — 1] and M^(?;)[s + As this is repeated k times, we pay time 

0{T{k + logTi)) to find all the minima. A priority queue handling the ranges 
will perform k minimum extractions and 0{k + logn) insertions, and its size 
will be limited to k. So the overall time is 0{t logn + fc(r + log k)) by using a 



priority queue with constant insertion time jCMPSq . Using t = t log m for any 



t = uj{l) we obtain time 0{t log to log n -\- kt log to lo g k) an d n log n + o{n log n 



bits of space. The best current linear-space solution [NN12| achieves better time 



and linear space, but the constant multiplying the linear space is far from 1. 



5.1. Dynamism 

We can directly apply the result on semigroups given in Section ITT] Note 
that, while in the static scenario we achieve a better result than for sums, in 
the dynamic case the result is slightly worse. 

Theorem 9. Given n points on anU xU grid, with associated values in [0, W), 
there is a structure u.sing nlogU{l + o{l) + l/t) bits, for any t > 1, that answers 
the queries in Theorem\^in iime 0(log C/log ri(l+min(t log H/, log log VF/ log log 
and supports point insertions /deletions, and value updates, in time 0(log t/(log n+ 
t\ogW)). 



6. Range Majority for Fixed a. 

In this section we describe a data structure that answers a-majority queries 
for the case where a is fixed at construction time. Again, we enrich the wavelet 
tree with additional information that is sparsified. We obtain the following 
result. 
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Theorem 10. Given n two-dimensional points with associated values in [0, W) 
and a fixed value < a < 1, all the a-majorities inside a query rectangle Q = 
[xo, Si] X [yo, yi], MAJORlTY(a, Q), can he found in time 0{t\ogm\o^ n), with a 
structure using n((2 + l/t) logn + logm) hits, for any t>l and m ~ niin(ri,, W). 

Proof. We say that a set of values C is a set of a- candidates for S' C S ii each 
a-majority value w of S' belongs to C. In every wavelet tree node v we store 
an auxiliary data structure A{v) that corresponds to elements of W{v). The 
data structure A{v) enables us to find the set of a-candidates for any range 
[ri • s, r2 ■ s] in W{v), for a parameter s — tlogm. We implement A(v) as a 
balanced binary range tree T{v) on W{v). Every leaf of T{v) corresponds to 
an interval of W{v) of size s/a. The range of an internal node u; of T is the 
union of ranges associated with its children. In every such node w, we store 
all a-majority values for the range of w (these arc at most 1/a values). The 
space required by T{v) is (2|Vl^(w)|/ (s/a))(l/a) logm = 0{\W{v)\/t) bits, which 
added over all the wavelet tree nodes v sums to {n\ogn)/t bits. 

Given an interval [ri ■t,r2 • t], we can represent it as a union of O(logn) 
ranges for nodes Wi £ T{v). If a value is an a-majority value for [ri ■ t,r2 • t], 
then it is an a-majority value for at least one Wi. Hence, a candidate set for 
[ri ■ t,r2 ■ t] is the union of values stored in Wj. The candidate set contains 
0((l/a)logn) values and can be found in 0((l/a)logn) time. 

Moreover, for every value c, we store the grid G(c) of Lemma [21 which 
enables us to find the total number of elements with value if in any range Q = 
[xq, Xi] X [j/0, xi]. Each G{c), however, needs to have the coordinates mapped to 
the rows and columns that contain elements with value c. We store sequences Xc 
and Yc giving the value identifiers of the points sorted by x- and y-coordinatcs, 
respectively. By representing them as wavelet trees, they take 2nlogm -I- o{n) 
bits of space and map in constant time any range [xq, xi] or [yo, yi] using rank 
operations on the sequences, in O(logm) time using the wavelet trees. Then the 
local grids, which overall occupy other X]ce[i m] ""c log nc + o{nc) < n \og(n/m) + 
o{n) bits, complete the range counting query in time 0(log n). So the total space 
of these range counting structures is nlogn + nlogm + o(n). 

To solve an a-majority query in a range Q = [xi,X2] x [j/i,2/2]i we visit 
each node v G imp(j/o, We identify the longest interval [ri ■ s,r2 ■ s] C 
[i?(RoOT, V, xo), i?(R00T, V, xi)]. Using A{v) the candidate values in [ri-s, r2-s] 
can be found in time 0((l/a) logn). Then we obtain the values of the ele- 
ments in [i?(RoOT, w, Xo), ri • s) and (r2 • s, R{RoOT,v, xi)], in time 0(s log n) 
by traversing the wavelet tree. Added over all the v S imp(?/o, ?/i), the cost to 
find the (1/a + s)logn candidates is 0((l/a -f- slogn)logn). Then their fre- 
quencies in Q are counted using the grids G{c) in time 0((l/a + s) log^ n), and 
the a-majorities are finally identified. 

Thus the overall time is 0(tlogmlog^ n). The space is n(21ogn + logm -|- 
(log n) /t) , higher than for the previous problems but less than the structures to 
come. □ 
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A slightly better (but messier) ti me complex ity can be obtained by using 
the counting structure of Bose et al. BHMMOgj instead of that of Lemma [2l 
storing vahie identifiers every s tree levels, 0(tlogmlogn(min(tlogm,logn) + 
log n/ log log n)). The space increases by o{n\og{n/m)). On the other hand, 
by using s = i = 1 we increase the space to 0{nlogn) integers and reduce the 
query time to 0(log^ n). 



7. Range Median and Quantiles 

We compute the median element, or more generally, the fc-th smallest value 
w{p) in an area Q = [xq, xi] x [?/o, yi] (the median corresponds to fc — Count((5)/2). 

From now on we use a different wavelet tree decomposition, on the universe 
[0, m) of w{-) values rather than on y coordinates. This can be seen as a wavelet 
tree on grids rather than on sequences: the node v of height h{v) stores a grid 
G{v) with the points p e P such that [u;(p)/2''("'J = L{v{..\\ogm] - h{v)}). 
Note that each leaf c stores the points p with value w{p) ~ c. 

Theorem 11. Given n two-dimensional points with associated values in [0, W), 
the k-th smallest value of points within a query rectangle Q = [xo,Xi] x [yo,yi], 
QUANTlLE(fc, Q), can be found in time O^ilognloggm), with a structure using 
nlognlog^m + 0{n\ogm) bits, for any £ G [2,m] and m = miii{n,W). 

Proof. We use the wavelet tree on grids just described, representing each grid 
G{v) with the structure of Lemma [2l To solve this query we start at root of 
the wavelet tree of grids and consider its left child, v. li t = Count((5) > k 
on grid G(y), we continue the search on v. Otherwise we continue the search 
on the right child of the root, with parameter k — t. When we arrive at a leaf 
corresponding to value c, then c is the fc-th smallest value in P n Q. 

Notice that we need to reduce the query rectangle to each of the grids G{v) 
found in the way. We store the X and Y arrays only for the root grid, which 
contains the whole P. For this and each other grid G{v), we store a bitmap 
X{v) so that = & iff the i-th point in cc-order is stored at the 6-child of v. 

Similarly, we store a bitmap Y{v) with the same bits in y-order. Therefore, when 
we descend to the 6-child of v, for b S {0, 1}, we remap xq to Rank(X(u), b, xq) 
and xi to Rank(X(v), b, xi + 1) — 1, and analogously for yg and yi with Y{v). 

The bitmaps X{v) and Y{v) add up to 0{n\ogm) bits of space. For the 
grids, consider that each point in each grid contributes at most logn + o(l) 
bits, and each p gF appears in [log m] — 1 grids (as the root grid is not really 
necessary) . 

To reduce space, we store the grids G{v) only every [log^] levels (the bitmaps 
X{v) and Y{v) are still stored for all the levels). This gives the promised space. 
For the time, the first decision on the root requires computing up to £ operations 
Count(Q), but this gives sufficient information to directly descend log^ levels. 
Thus total time adds up to 0(^lognlog£m). □ 

Again, by replac ing our structure of Lemma [2] by Bose et aZ.'s counting 
structure [BHMMOgj . the time drops to 0(^ logn log^ m/ loglog n) when using 
n log n \ogg m(l + o(l)) + 0{n log m) bits of space. 
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The basic wavelet tree structure allows us to count the number of points p G 
Q whose values wij)) fall in a given range [wo,wi], within time 0{i\ognlogfm) 
or 0{£ log n log^ m/ log log n) . This is another useful operation for data analysis, 
and can be obtained with the formula X)«eiMP(«;o Count((5). 

As a curiosity, we have tried, just as done in Sections H] and [51 to build 
a wavelet tree on the y-coordinates and use a one-dimensional data structure. 
We used the optimal linear-space structure of Brodal and J0rgensen 
However, the result is not competitive with the one we have achieved by building 
a wavelet tree on the domain of point values. 



8. Range Majority for Variable a 

We can solve this problem, where a is specified at query time, with the same 
structure used for Theorem [TT] 

Theorem 12. The structures of Theorem \ll\ can compute all the a-majorities 
of the point values insideQ, Majority(q!, Q), in time O log nlog^m), where 
a can be chosen at query time. 

Proof. For a > ^ we find the median c of Q and then use the leaf c to count 
its frequency in Q. If this is more than a ■ Count(Q), then c is the answer, 
else there is no a-majority. For a < ^, we solve the query by probing all the 
(i • a)CoUNT((3)-th elements in Q. □ 

Once again, we attempted to build a wavelet tree_on_w-coordinates, using 
the one-dimensional structure of Durocher et al. |DHM+ ll| at each level, but 
we obtain inferior results. 



Culpepper et al. [CNPTlflj show how to find the mode, and in general the 



k most repeated values inside Q, using successively more refined Quantile 
queries. Let the fc-th most repeated value occur a ■ C ount(Q) times in Q, then 
we require at most 4/ a quantile queries jCNPTlOj . The same result can be 
obtained by probing successive values a = 1/2* with MAJORiTY(a) queries. 



9. Range Successor and Predecessor 

The successor (predecessor) of a value w in a rectangle Q = [xo,Xi] x [yo, yi] 
is the smallest (largest) value larger (smaller) than, or equal to, w in Q. We 
also have an efficient solution using our wavelet trees on grids. 

Theorem 13. The structures of Theorem \11\ can compute the successor and 
predecessor of a value w within the values of the points inside Q, 81100(1/;, Q) 
and Pred(w,(5), in time O(^lognlog^m). 

Proof. We consider the nodes v £ iup{w, +00) from left to right, tracking 
rectangle Q in the process. The condition for continuing the search below a 
node V that is in iMP(ii;, +00), or is a descendant of one such node, is that 
Count(Q) > on G{v). Suoo(w, Q) is the value associated with the first 
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leaf found by this process. Likewise, Pred(w, Q) is computed by searching 
IMP(— cx),w) from right to left. To reduce space we store the grids only ev- 
ery [log-^] levels, and thus determining whether a child has a point in Q may 
cost up to 0{l\ogn). Yet. as for Theorem [TTl the total time amortizes to 
0(£lognlogf m). □ 



Once again, storing one-dimensional data structures |CIK"'"08l . lMNU05l | on 



a y-coordinate-based wavelet tree does not yield competitive results. 



10. Dynamism 

Our dynamic wavelet tree of Lemma [5] supports range counting and point 
insertions / deletio ns on a fixed grid in time 0(log U log n/ log log n) (other trade- 
offs exist iNekOgj ). If we likewise assume that our grid is fixed in Theorems [TlJ 
[T2]and[T31 we can also support point insertions and deletions (and thus changing 
the value of a point). 

Theorem 14. Given n points on aU xU grid, with associated values in [0, W), 
there is a structure using nlogL/log^ W{1 + o(l)) bits, for any £ G [2, W], that 
answers the queries QuANTiLE, Succ and Pred in time 0{ilogU lognlog^ W/ 
log log n), and the MAJORlTY(a) operations in time 0(^^logJ71ogrtlog^ W^/ 
log log n) . It supports point insertions and deletions, and value updates, in time 
0{\og U log n \ogg W/ log log n) . 

Proof. We use the data structure of Theorems [TTl [T^ and [T^ modified as follows. 
We build the wavelet tree on the universe [0, W) and thus do not map the 
universe values to rank space. The grids G{v) use the dynamic structure of 
Lemma m on global y-coordinates [0,U). We maintain the global array X of 
LemmaElplus the vectors X{v) of TheoremlTl] the latter using dynamic bitmaps 



HMld . lNSld |. The time for the queries follows immediately. For updates 



we track down the point to insert /delete across the wavelet tree, inserting or 
deleting it in each grid G{v) found in the way, and also in the corresponding 
vector X{v). □ 



11. Conclusions 

We have demonstrated how wavelet trees |GGV03I | can be used for solving a 
wide range of two-dimensional queries that are useful for various data analysis 
activities. Wavelet trees have the virtue of using little space. By enriching them 
with further sparsified data, we support various complex queries in polylogarith- 
mic time and linear space, sometimes even succinct. Other more complicated 
queries require slightly superlinear space. 

Wc believe this work just opens the door to the possible applications to data 
analysis, and that many other queries may be of interest. A prominent one 
lacking good solutions is to find the mode, that is, the most frequent value, in 
a rectangle, and its generalization to the top-fc most frequent values. There has 
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been some recen t progre ss on the one-dimensional version and even in 



two dimensions |DM11| | , but the results are far from satisfactory. 

Another interesting open problem is how to support dynamism while re- 
taining time complexities logarithmic in the number of points and not in the 
grid size. This is related to the problem of dynamic wavelet trees, in partic- 
ular supporting insertion and deletion of y-coordinates (on which they build 
the partition). Dynamic wavelet trees would also solve many problems in other 
areas. 

Finally, a natural question is which are the lower bounds that relate the 
achievable space and time complexities for the data analysis queries we have 
considered. These are well known for the more typical counting and reporting 
queries, but not for these less understood ones. 
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Appendix A. Optimal-Space Representation of Grids 

We analyze the representation described in Section [231 showing how it can 
be made near-optimal in the information-theoretic sense. Recall that our rep- 
resentation of a set of points of [0, f/)^ consists in storing two sorted arrays X 
and Y , which reduce the [0, U) values to [0, n). The points in the [0, n) x [0, n) 
grid have exactly one point per row and one per column. 

An optimal-space representation of the above data uses the data structure of 



Okanohara and Sadakane [OS07| for mapping the sorted X coordinates (where 
point X{i) is represented as a bit set at position i X(i) in a. bitmap of length 
n-\-U), a similar structure for the Y coordinates, and a wavelet tree for the grid 
of mapped points. The former occupues nlog ^^^^ -I- 0{n) = log C^jJ"") + 0{n) 
bits of space, gives constant-time access to the real coordinate of any point 
X{i) = selecti[i) — i, and takes 0(log ) time to map any value x to rank 
space at query time; and similarly for Y. The wavelet tree requires nlogn -|- 
o{n) = logn! -I- 0{n) bits. OveraU, if we ignore the 0(n)-bit redundancies, the 
total space is log (n!(^+") ) bits. 
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Hence our representation can be in any of nlC^^") configurations. Note 
that we can represent repeated points, which is useful in some cases, espe- 
cially when they can have associated values. We show now that our number 
of configurations is not much more than the number of possible configurations 
of P even if repeated points are forbidden, i.e., n distinct points from {0,U)^ 

can be in ) < C^^"") n\ configurations. The difference is not so large be- 
cause {^^") nl < (^^)c", for any c > 4. In terms of bits this means that 

2 log C^^") + log nl < log ) + n log c and therefore our representation is at 
most 0{n) bits larger than an optimal representation, aside from the 0{n) bits 

we are already wasting with respect to log ^nlC^^") ^. 

To see this, notice that {^^" fnl = {{U+n)l/ {nlUl jfnl = {{U+n)l/Ul f /nl = 
(nrJo^(f^ + n — iY)/nl. For sufficiently large c, this is < (nr=o^ '^(^^ ~ O)/*^' = 
c"C/2!/((f/2 -ri)!n!) (^')c". 

We need c > [U + n — if' j (t/^ — i) for any < i < n. We next show that, 
if n < [/, then ([/ + nf jU"^ > {U + n - if/{U'^ - i), and thus it is enough to 
choose c> {U + nf /U'^. Simple algebra shows that the condition is equivalent 
to i < 2{U -I- n) — (1 + {n/U)f. Since we assume for now that n/U < 1, the 
inequality is satisfied if i < 2{U + n) ~ 4. Since i < n, the inequality always 
holds (as U,n > 1). Thus it is sufficient that c > {U + nf/U'^, which is no 
larger than 4 if n < U. 

Let us now consider the case n > U. Our analysis still holds, up large enough 
n. Since i < n — 1, it is sufficient that n— 1 < 2{U + n) — (1 -I- {n/U)Y . Simple 
algebra shows this is equivalent to the cubic inequality 2U^+nU^ — 2nU—n'^ > 0. 
As a function of U this function has three roots, the only positive one a.tU = ^Jn. 
Therefore it is positive for U > ^Jn, i.e., n < U^, which covers all the possible 
values of n. 



Appendix B. Reducing Space for Variance 

We now discuss how to bring the 2 [log W~\ space factor associated to storing 
weights w'{p) = w{pf closer to [logF], where V is the overall variance. 

Instead of storing w'{p), we can store w" {p) = {w{p) — [T/n])^, where T/n 
is the average of all the points. To obtain Var((5) from the sum of the w" in 
Q notice that {w{p) - {Tg/q))^ - (wip) - \T/n]f - 2{w{p) - \T / n-\){\T / n] - 
{Tg/q)) + {{T/n) - {TQ/q)f, where Tq = Sum(Q) and q = CouNT(g). This 
formula makes use of the stored w"{p) values, as well as queries Sum((3) and 
Count(Q). Note that rounding is used to keep the values as integers, hence 
limiting the number of bits necessary in its representation. 

To avoid numeric instability and wasted space, it is better that T/n is close 
to Tq/q. This simultaneously yields smaller {w{p) — \T/n~\f values and reduces 
the ([T/ri] — {Tq/q)) factor in the subtraction. To ensure this we may (logically) 
partition the space and use the local average, instead of the global one. Each 
level of the wavelet tree partitions the space into horizontal non-overlapping 
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bands, of the form [j • 71/2*^, (j + 1) • n/2'^), for some k. At every level we 
use the average of the band in question. This allows us to compute variances 
for rectangles whose y coordinates are aligned with the bands, while the x 
coordinates are not restricted. For a general rectangle Q we decompose it into 
band-aligned rectangles, just as with any other query on the wavelet tr ee (recall 
Equati on ([T|)). Alternatively, we can use a variance update formula |CGL79l 



Wel62| that is stable and further reduces the instability of the first calculation. 
We rewrite the update formula of Chan et al. CGL79| in terms of sets, since 
originally it was formulated in one dimension. 

Lemma 15 ( |CGL79l Eq. 2.1]). Given two disjoint sets A and B, which have 
values w{p) associated with element p, where Ta ~ Sum(A), Tb = Sum(_B), 
m = Count(A), n = CouNT(i?), Sa = YjpaA^'^iP) ~ T^Ajmf and Sb = 
J2p£Bi'^(p) ~ Tb/'iT')'^, the following equalities hold: 

Ta\jb = Ta+Tb (B.l) 

Saub = Sa + Sb+ , , (-Ta - Tb? (B.2) 

n(m, + n) m 

Notice that Var(A) = Sa- 
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